-
Notifications
You must be signed in to change notification settings - Fork 58
RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-30720: Remove GCS FT for Lifecycled RayCluster #892
Conversation
|
@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## ray-jobs-feature #892 +/- ##
====================================================
+ Coverage 93.45% 93.48% +0.03%
====================================================
Files 21 21
Lines 1910 1889 -21
====================================================
- Hits 1785 1766 -19
+ Misses 125 123 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@kryanbeane: This pull request references RHOAIENG-30720 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b105e4c to
fd045a5
Compare
| RAY_VERSION = "2.47.1" | ||
| # Below references ray:2.47.1-py311-cu121 | ||
| CUDA_RUNTIME_IMAGE = "quay.io/modh/ray@sha256:6d076aeb38ab3c34a6a2ef0f58dc667089aa15826fa08a73273c629333e12f1e" | ||
| MOUNT_PATH = "/home/ray/scripts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while i'm here, just moving the new mount path to a constant since it's referenced in so many places
fd045a5 to
83d2a73
Compare
4350b28 to
cb5589c
Compare
83d2a73 to
2c23337
Compare
2c23337 to
1f2241e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Great work! Thanks Bryan! :)
/approve
|
looks good to me. ran a sample test against ROSA cluster successfully. Feel free to merge |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LilyLinh, pawelpaszki The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
416ba8d
into
project-codeflare:ray-jobs-feature
Issue link
RHOAIENG-30720
What changes have been made
Removed GCS FT from the Lifecycled RayJob implementation. Will discuss this with the team but my reasoning is as follows:
If GCS FT was enabled with RayJobs as the entrypoint, and GCS fails the following will happen:
RUNNINGstate and so the RayJob will be set toFAILEDbut Kuberay OperatorThe only case where a RayJob will retry is if
backoffLimitis set in job. This would signal to Kuberay how many times to spin up a new RayCluster for the Job, but it does not try to recover the same RayCluster. - making GCS FT for this useless.In long lived RayCluster scenario, if GCS fails and FT is enabled, the same RayCluster restarts, a new one is not created and so GCS FT does provide value.
I will include a second PR to fix GCS FT for long-lived RayClusters.
Note: We could propose an issue to get GCS FT working upstream, it would require Kuberay Operator to recognise that GCS FT is enabled for the Lifecycled RayCluster, and try to restart that Cluster instead of spinning up a new one. This logic doesn't exist as far as I could see so we could make an issue for it.
Verification steps
N/A